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Abstract 

In this paper, we extend a previously developed recursive entropic segmentation scheme for applica- 
tions to biological sequences. Instead of Bernoulli chains, we model the statistically stationary segments 
in a biological sequence as Markov chains, and define a generalized Jensen-Shannon divergence for 
distinguishing between two Markov chains. We then undertake a mean-field analysis, based on which 
we identify pitfalls associated with the recursive Jensen-Shannon segmentation scheme. Following this, 
we explain the need for segmentation optimization, and describe two local optimization schemes for 
improving the positions of domain walls discovered at each recursion stage. We also develop a new 
termination criterion for recursive Jensen-Shannon segmentation based on the strength of statistical 
fluctuations up to a minimum statistically reliable segment length, avoiding the need for unrealistic null 
and alternative segment models of the target sequence. Finally, we compare the extended scheme against 
the original scheme by recursively segmenting the Escherichia coli K-12 MG1655 genome. 
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I. Introduction 

Large-scale genomic rearrangements, such as transpositions, inversions, and horizontal gene 
transfer (HGT), play important roles in the evolution of bacteria. Biological functions can be 
lost or gained in such recombination events. For example, it is known that virulent genes are 
frequently found near the boundaries of HGT islands, suggesting that virulence arise from the 
incorporation of foreign genetic material [l]-[3]. In a recent essay, Goldenfeld and Woese 
argued that the mosaic nature of bacterial genomes resulting from such large-scale genomic 
rearrangements requires us to rethink familiar notions of phylogeny and evolution [4]. As a first 
step in unraveling the complex sequence of events that shape the probable evolutionary history 
of a bacterium, we need to first identify the recombination sites bounding recombined segments, 
which are frequently distinguishable statistically from their flanking sequences. 

We do this by modeling the native and recombined segments in a genome as stationary Markov 
chains. The boundaries between such statistically stationary segments (or domains) are called 
change points in the statistical modeling literature, or domain walls in the statistical physics 
literature. Given a nucleotide or amino acid sequence of length A^, the problem of finding 
M segments generated by P stationary Markov chains is called segmentation [5], [6]. Many 
segmentation schemes can be found in the literature (see minireview by Braun and MuUer [7]). 
For M ^ P both unknown, Gionis and Mannila showed that finding the optimal segmentation 
for a given sequence is NP-hard [8]. Therefore, some segmentation schemes assume P = M, 
while others assume that P is small, and known beforehand. 

Of these, the recursive segmentation scheme introduced by Bernaola-Galvan et al. [9], [10] 
is conceptually appealing because of its simplicity. In this scheme, a given sequence is recur- 
sively partitioned into finer and finer segments — all modeled as Bernoulli chains — based 
on their Jensen-Shannon divergences. The unknown number of segments M (assumed to be 
equal to the number of segment types P) is then discovered when segmentation is terminated 
based on an appropriate statistical criterion. In this paper, we describe our extensions to this 
recursive segmentation scheme. In Sec. HIl we explain how Markov chains model the short- 
range correlations in a given sequence better than Bernoulli chains, and thereafter generalize 
the Jensen-Shannon divergence to distinguish between two Markov chains. In Sec. Hill we carry 
out a mean-field analysis to better understand the recursive segmentation scheme and its pitfalls. 
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before describing two local segmentation optimization algorithms for improving the statistical 
significance of domain walls in Sec. |IVl We also develop in Sec. |V]a new termination criterion, 
based on the intrinsic statistical fluctuations of the sequence to be segmented, for the recursive 
segmentation scheme. Finally, we compare our extended scheme against the original scheme by 
recursively segmenting the Escherichia coli K-12 MG1655 genome in Sec.|VIl before concluding 
in Sec. IWl 

II. Modeling Segments As Markov Chains 

In the earliest recursive segmentation scheme proposed by Bemaola-Galvan et al. [9], [10], 
the divergence between 1-mer statistics from two or more subsequences of a given sequence is 
examined. These subsequences are modeled as Bernoulli chains (equivalent to Markov chains 
of order K = 0), even though it is well known that biological sequences exhibit dinucleotide 
correlations and codon biases [11]-[15]. Later versions of the recursive segmentation scheme 
examine higher order subsequence statistics, so as to take advantage of different codon usage in 
coding and noncoding regions [16]-[18], but these are still assumed to be drawn from Bernoulli 
chains, albeit with extended alphabets. The first study we are aware of modeling subsequences 
as Markov chains for recursive segmentation is the work by Thakur et al. [19]. 

In this section, we will explain why the observed dinucleotide frequences and codon biases 
in biological sequences can be better modeled by Markov chains of order K > 0, compared to 
Bernoulli chains with the same high order statistics. We will then generalize the Jensen-Shannon 
divergence, so that it can be used in entropic segmentation schemes to quantify the statistical 
difference between Markov chains of order K > 0. Finally, we discuss the added modeling 
complexities associated with using Markov-chain orders that vary from segment to segment, and 
change when segments are further divided. 

A. Markov Chains Versus Bernoulli Chains 

Given a sequence x = xiX2---xn, where the symbols Xi are drawn from an alphabet 
<S = {as}g^i containing S letters, we want to model x as being generated sequentially from a 
single stationary stochastic process. In the bioinformatics literature, x is usually modeled as a 
Bernoulli chain or as a Markov chain. For a Bernoulli chain, the symbols are obtained from 
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N independent trials, governed by the state probabilities 

P{xi = as) = P{as) = P., (1) 
whereas for a Markov chain of order K, the probabiUty 

p{xi = as\xi_i = ati,.. .,Xi_K = atx) (2) 

of finding the ith symbol to be conditioned by the K symbols preceding it. We call 

these the transition probabilities 

p{as\at^ ■ ■ ■ ati,) ^ Pts (3) 

of going from the K-mer atj ■ ■ to the 1-mer as. For the rest of this paper, we use the 
shorthand t ^ s to represent the transition ■ ■ ■ ag. 

A Bernoulli chain over the alphabet S is equivalent to a Markov chain of order K = 0, and 
is completely uncorrected, in the sense that P{xi,Xj) — P{xi)P{xj). Markov chains of order 
K > 0, on the other hand, contain short-range correlations, the scale of which is set by K, but 
no long-range correlations. Therefore, if the target sequence x contains short-range correlations, 
a Markov chain of nonzero order models x better than a Bernoulli chain over the the same 
alphabet. It is also possible to capture short-range correlations in x using Bernoulli chains over 
extended alphabets. For example, a Markov chain of order K over S and a Bernoulli chain over 
the extended alphabet S^~^^ can both be used to model the {K + l)-mer statistics of x. They 
are, however, not equivalent. 

Let fts be the number of times the {K + l)-mer • • • at^as appears in x. To model x as a 
Markov chain of order K over S, we must use these {K + l)-mer counts to make maximum- 
likelihood estimates 



fi 



Pts = ^^/^ (4) 

of the S^{S — 1) independent transition probabilities, subject to the normalization 

s 

tes^ s=i 

i.e. every sequence position in x contributes one count to the model estimation. A Bernoulli 
chain over the extended alphabet 5^+^, on the other hand, contains (5"^+^ — 1) independent 
state probabilities — {S^ — 1) independent parameters more than the Markov chain of order K. 
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Besides having many more independent parameters, there is also the subtle question of 
how to estimate the state probabilities, if we are to model x using the Bernoulli chain over 
^K+i |.]^g g|.^|.g probabilities {Pts} were given, we would generate a Bernoulli chain by 
sequentially appending + l)-mers drawn according to {-Pts}- This suggests that we partition 
the observed sequence x into nonoverlapping {K+l)-mers, and use their counts fl^ to determine 
the maximum-likelihood state probabilities 



However, only one in (K+l) sequence positions in x contributes a count to the model estimation. 
The result is that such a Bernoulli chain model of x would be much less statistically significant 
than a order-X Markov chain model of x, since we are using a smaller number of counts to 
estimate a larger number of parameters. Alternatively, we can note the fact that there are (K+l) 
different ways to partition x into nonoverlapping (K + l)-mers. If we combine the counts from 
these different partitions, then every sequence position contributes one count to model estimation, 
just as for the Markov chain model. 

No matter how we perform the model estimation, two adjacent {K + l)-mers in a Bernoulli 
chain over 5^+^ are guaranteed to be uncorrelated. This means that correlations at the (K+l)- 
mer level in x can only be partly captured by a maximum-likelihood Bernoulli chain over 
In contrast, a maximum-likelihood Markov chain of order K over S will capture most of the 
{K + l)-mer correlations in x. We can ensure that most or all of the {K + l)-mer correlations in 
X are captured by a Bernoulli chain model, by going to even larger extended alphabets, but as 
we have seen above, the statistical quality of the model deteriorates rapidly as we try to model 
N counts with an exponentially increasing numbers of parameters. Based on the discussions 
above comparing Markov chain and Bernoulli chain models, we argue that a Markov chain of 
order X is a more compact model of the {K + l)-mer statistics of an observed sequence x. 

B. Generalizing the Jensen-Shannon Divergence 

After the maximum-likelihood model of an observed sequence x is determined, we can 
compute its sequence likelihood within the model. This is the probability of getting x, if we 
use the maximum-likelihood model to generate random sequences of length N. For a Bernoulli 



fts 



(6) 



E 
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chain model of x over S, we find the sequence likelihood to be given by 

s ^ 



where fs is the number of times the 1-mer ag appears in x. 

Now, if we suspect that x actually comprises two statistically stationary segments x^ = 
xiX2---Xi and xr = Xi+iXi^2 ■ ■ ■ xn, with domain wall at the cursor position i, we must 
determine one maximum-likelihood Bernoulli chain model for x^, and another for xr. This 
involves tallying the 1-mer counts and for xl and xr respectively, and then estimating 
the maximum-likelihood state probabilities 



J s pR J s 

ES fL'' * V^-S fR' 

s'=lJs' Z^s'=lJs' 



= Pf = -s^. (8) 



In this two-BemouUi-segment model of x, the sequence likelihood is given by 

P2M=fi{ps'y^ {ps'^y^- (9) 

s=l 

Because we have more free parameters in the two-segment model to fit 1-mer statistics in the 
observed sequence x, we have P2{i) > Pi for all i. The Jensen- Shannon divergence [20] 

A(^) = log = ^ [-/, log P, + ff log P,^ + Jf log Pf] > 0, (10) 

a symmetric variant of the relative entropy known more commonly as the Kullback-Leibler 
divergence, is then a quantitative measure of how much better the two-segment model fits x, 

compared to the one-segment model, subject to the constraints 

s 

s=l 

In the literature, the Jensen-Shannon divergence is only defined for proper distributions {P^}, 
{P/^}, and {Ps}, for which 

s s s 

s=l s=l s=l 

When we want to compare the (i^-l-l)-mer statistics of the subsequences x^, and xr, by modeling 

these as Bernoulli chains, the expression for the Jensen-Shannon divergence becomes 

s 

^(^) = E E [-/t^ log + log A^. + fu log A^] , (13) 
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comparing the proper distributions {P^^} and {P^} against {Pts}, which satisfy the normaliza- 
tion condition 

As we have explained, the {K+ l)-mer statistics of x^, xr, and x are better modeled as order- 
K Markov chains over S. These maximum-likelihood one-segment and two-segment Markov- 
chain models are determined by tallyng the transition counts fts, /t^, and f^, and estimating 
the transition probabilities pts, Pts^ and according to Eq. @). However, the full sets {ptg}, 
{Pts}, and {pts} of transition probabilities are not proper distributions, because they sum to 

te5^ s=i te5^ s=i tes^ s=i 

Rather, we must think of them as 5*^ sets of proper distributions, in which case the Jensen- 
Shannon divergence that simultaneously compares all distributions generalizes to 

s 

^(^) = E E [-/t^ logPt. + ft ^ogPt + hi logpf,] . (16) 

te5A' s=i 

A similar expression for A(i) was derived by Thakur et al. [19]. 

Just as for Bernoulli-chain models, the Jensen-Shannon divergence in Eq. (fT6l) for Markov- 
chain models of x^, xr, and x, is also the log ratio of the one-segment and two-segment sequence 
likelihoods 

s s 

^i(x) = n n (^t^)^*^ ' ^^(x, = n n ip'sY'^ k)^*^ ' (i^) 

tG<s^ s=i tes'^ s=i 

respectively. Here let us note that the summary of transition counts {fts, /t^, /t^} is satisfied by 
more than one sequence, but all these sequences have the same sequence likelihoods within the 
maximum-likelihood Markov chain models. 

C. Using Variable Markov-Chain Orders for Segmentation 

When all the segments in a sequence are modeled as Bernoulli chains, we need to determine 
only the optimal domain wall positions {ii, Z2, • • • , U/}- When the segments are modeled as 
Markov chains, we need also decide what Markov-chain orders {Ki, K2, . . . , K^^i} to use for 
the segments. These two problems are coupled, and we shall understand in this subsection how 
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to solve them, within the recursive segmentation framework, by considering the refinement of a 
sequence into two segments. 

We start by considering the one-segment Markov-chain model of the sequence x, which serves 
as the null model against which the two-segment Markov-chains model is compared. In principle, 
we can fit x to a one- segment Markov-chain model of any order K. The one- segment sequence 
likelihood Pi{K) given in Eq. (flTI) increases with K, because of the exponentially increasing 
number of free parameters to fit the observed counts. To perform segmentation, we want K to 
be as large as possible, so that we can distinguish segments differing only in their high-order 
statistics. To determine the maximum Markov-chain order K* justified by the statistics of x, 
we compare the penalized sequence likelihoods for various K. The negative logarithms of these 
penalized sequence likelihoods are also called the information criteria 

ct> = -2\ogP^{K) + e, (18) 

where 6* is a penalty function. Common information criteria found in the literature are, the Akaike 
information criterion (AIC) [21], the Schwarz information criterion (SIC) [22], and the Bayes 
information criterion (BIC) [23]. For a Markov chain of length N, order K over an alphabet of 
S letters, these differ in their penalty functions 

S^{S-l), AIC; 
e{N, S,K) = \ i^i^+i logiV, SIC; (19) 
S^{S-l)\ogN, BIC. 

In particular, Katz showed that the BIC gives an unbiased estimate of the true order of a Markov 
chain [23], so we use the Markov-chain order K* minimizing the BIC as the order of our one- 
segment model of x. 

Since we do not know beforehand whether x contain statistically distinct segments, or whether 
the segments, if present, differ in low-order or high-order statistics, it is safest to search for them 
at K*, the maximum statistically justifiable Markov-chain order. In the recursive segmentation 
scheme described in this paper, we would compute the Jensen-Shannon divergence A(2) given 
in Eq. (fT6l) as a function of the cursor position i at order K*, and partition x into two segments 
Xi = xi...Xi* and x^ = Xj.+i---Xiv, where i* is the sequence position maximizing the 
order- /C* Jensen-Shannon divergence. Once the segments are discovered, we can compute the 
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maximum Markov-chain orders and that we would use to further partition and xr 
by minimizing their respective BICs. Because x^ and x/j are shorter than x, we naturally have 

Kl,K*ji < K*. 

Next, suppose we believe that x is indeed composed of two statistically distinct segments, 
Xi and x/j, but i* is not the best position for the domain wall between them. To find a better 
position i** for the domain wall between x^ and x/j, we can compute the two-segment sequence 
likelihoods P2ii) for x at various cursor positions i, using an order- fC^ model for x^, and an 
order- i^lj, model for x/?, and pick i** to be the sequence position maximizing P2. Alternatively, 
we can compute the Jensen- Shannon divergence spectrum A(z) at Markov-chain order K = 
mm{Kl, K]^), and accept as i** its divergence maximum. We choose K = mm{Kl, K^) for 
doing so because this Markov-chain order is statistically justifiable in both segments, and also 
has a smaller uncertainty associated with the domain wall position, as discussed in Sec. IIII-AI 

III. Mean-Field Picture of Recursive Jensen-Shannon Segmentation 

In statistical physics, intrinsic fluctuations in the properties of a physical system (for example, 
the local density in a fluid) makes its true behaviour difficult to analyze and understand. In 
many physical systems, however, a great deal about their properties can be understood from 
simplified mean-field pictures, where we ignore statistical fluctuations, and assume that these 
properties take on system-wide values. In the same spirit, we develop in this section a mean-field 
picture of the recursive Jensen-Shannon segmentation scheme proposed by Bemaola-Galvan et 
al. [9], [10], and explain the need to move or remove domain walls that have lost statistical 
significance as the segmentation is recursively refined. We start by examining in Sec. IIII-AI the 
general anatomy of a Jensen-Shannon divergence spectrum, and how transitions rare on one 
side of the cursor position give rise to 'noise' in the spectrum that cloud our understanding of 
recursive segmentation. We argue that a mean-field picture will be helpful, and thus proceed in 
Sec. IIII-Bl to define the appropriate mean-field limit, and thereafter perform mean-field analysis 
on the recursive Jensen-Shannon segmentation scheme. 

A. Influence of Rare Transitions on the Jensen-Shannon Divergence Spectrum 

As discussed earlier, the most straightforward way to determine the optimal point i* to 
cut a given sequence x = X1X2 ■ --xn into two segments is to compute the Jensen-Shannon 
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divergence spectrum A{i) between the left segment = X1X2 ■ ■ ■ Xj and the right segment 
= ■ ■ - xn for all cursor positions < i < A^, and then determine the divergence 

maximum i* such that A(z*) = maxjA(z). As we move away from i*, each step we take 
results in the left and right distributions increasing or decreasing by one transition, giving rise to 
discrete jumps SA in the Jensen-Shannon divergence. From the definition of the Jensen- Shannon 
divergence in Eq. (fT6l) . we find that 



5A 



s 

EE 

tg^A- s=l L 



Pts 



S 

EE 

tg^A' s = l 



Pts 



ts 



(20) 



for some given changes Sf^^ = —Sf^^ to the transition counts. When the cursor position is shifted 
over by one base, only a single transition t — s is affected, for which Sf^^ = ±1 = —Sf^^. 



However, all S transition probabilities associated with the K-mer dt^cit^ 
and we have to write the jump as 



OLtu- are affected. 



bp. 



ts' 



+ E 



<5/S. log + § 



Pts' 



where we demand that 5/^^, = b^s' = —bf^g. 
Noting that 

/1 



Pts 



cL,R 



fi 



fi 



L,R 



c^L,R 
OPts 



ft 



L,R 



ft 



L,R 



f 



L,R 



L,R 



and also 



fi 



L,R 

L,R ^Pts 



t ^Pts 



5/^/ - 



pL,R 
Its 
cL,R 



oft = Oftj - Pts oft 



L,R 



■~L,R r i:L,R 



Pts ft 

we simplify the expression for the divergence jump to 

s s s s 

bA = j2 sfts' ^ogpt, + ^fts' - E Pts' ^ft + E ^ftl logpf,, + 



(21) 



(22) 



(23) 



s'=l 



Pts 



-L,R 
ts 



(24) 



1. As we 



where we made use of the facts that bf^^ = —f^, bf^ = —bff-, and Yls'=iPts 
can see, for a unit shift to the right, the sign and magnitude of the jump depends only on the 
transition probabilities, but not the transition counts. Here let us distinguish between common 
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and rare states, versus common and rare transitions. Common (or rare) states are {K + l)-mers 
with high (or low) counts /t^'^, while common (or rare) transitions are {K + l)-mers associated 
with high (or low) transition probabilities p^j . 



160 180 200 220 240 




sequence position 



Fig. 1. A typical Jensen-Shannon divergence spectrum, showing the general downward trends away from the domain wall 
(dashed vertical line), along with spikes that arise from the discrete nature of the sequence. Sometimes, regions which go against 
the general trend will also appear. (Inset) A typical spike structure around the divergence maximum. 

Fig. [T] shows a typical Jensen-Shannon divergence spectrum obtained from an artificial se- 
quence of length = 400 over S = A letters, consisting of two length A''^^ = Nji = 200, K = 
segments. General downward trends left and right of the single domain wall at i = 200 are 
observed in the Jensen-Shannon divergence spectrum, though we sometimes find regions going 
against the trend locally. When this happens, it is statistically acceptable, and frequently desirable, 
to place additional domain walls in the sequence, even though in this example we know there is 
only one true domain wall. Spikes, which are large divergence jumps resulting from transitions 
rare in one of the segments, feature ubiquitously in the Jensen-Shannon divergence spectra of 
discrete sequences. 

From this analysis, we learned that the uncertainty in determining the true domain wall position 
is dictated by the competition between common and rare transitions. The common transitions, 
which are present in larger numbers, determine the average jumps ^A^ and 5A/j and hence 
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the general downward trends left and right of the domain wall. The rare transitions, on the 
other hand, are associated with spikes (local maxima) in the divergence spectrum. Specifically, 
rare transitions that are most asymmetrically distributed between the two segments are the most 
important, since they give rise to the largest spikes |(5A|inax- Moving k bases away from the 
true domain wall, we expect the Jensen-Shannon divergence to decrease by roughly /c|(5A|. If 
k is small, k\5A\ < |5A|inax, and a single maximal spike encountered within these k bases 
will bring the divergence up to a value greater than that at the true domain wall, throwing off 
the cut that we make. In contrast, a maximal spike encountered after we have moved more 
than k — |5A|niax/|(^A| bases away from the true domain wall will not be able to raise the 
divergence beyond that observed at the true domain wall. The ratio |5A|inax/|(^A| therefore 
gives a quantitative measure of the uncertainty involved in determining the position of the true 
domain wall. 

For a fixed sequence length A^, the number of rare transitions increases, while the transition 
probabilities of the rarest transitions decrease, with increasing K. We thus find more and stronger 
spikes. For a fixed Markov-chain order K, the number of rare transitions remain more or less 
constant with decreasing N, but the transition probabilities of the rarest transitions decrease 
with decreasing as a result of stronger statistical fluctuations in the transition counts. We 
therefore find stronger spikes. The proliferation of strong spikes makes segmentation unreliable, 
and also distracts from our understanding of the recursive segmentation scheme. This is where 
a mean-field picture, within which we can study the progress of recursive segmentation in the 
absence of such statistical fluctuations, would be extremely helpful. 

B. Jensen-Shannon Divergence Spectrum in the Mean-Field Limit 

In a discrete genomic sequence of nucleotides, the sequence positions i and j take on integer 
values. The frequencies of various X-mers occurring within the interval > i) are also 
integers. From the previous subsection, we understood how a spike in the Jensen-Shannon 
divergence spectrum arise when there is a unit increment for a rare transition count. Such 
rare transitions, of course, occur infrequently along the sequence. If we can distribute the unit 
increment associated with a rare transition over the interval between two such transitions, the 
statistical effect of the increment will not be piled up over a single base as a spike. This can be 
done not just for rare transitions, but for all transitions. A continuum description of the sequence 
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can then be obtained by allowing both the sequence positions and statistical frequencies to vary 
continuously, as shown in Fig. [2l 

discrete sequence positions, integer counts 

■T^HTHCT|TCT|CHCCCTTHT|TT£HT|T|T|C| 

continuous sequence positions, real counts 
Fig. 2. Going from a discrete description to a continuum description of a nucleotide sequence. 

Depending on how we break the unit increments over one base into fractional increments over 
many bases, there are many ways to redistribute the transition counts over the interval > i) 
of the sequence. If our goal is to model this interval as a stationary Markov chain of order K, 
then in the mean-field limit we distribute the {K + l)-mer statistics within [i, j) uniformly along 
the interval. In the mean-field limit so defined for [i, j), the count ■* of the transition t ^ s 
within the subinterval [z', j' > i') C is given by 

ft^''' = ^ftf\ (25) 

J 

where is the net t — > s transition count within [i, j). In this way, we remove local fluctuations 
in the (K + l)-mer statistics within the interval. 

In Fig. [3l we show as an example the Jensen-Shannon divergence spectrum for an artificial 
binary sequence consisting of ten mean-field segments. As we can see, peaks in the divergence 
spectrum occur only at domain walls, but not all domain walls appear as peaks in the divergence 
spectrum. Some domain walls manifest themselves as kinks in the divergence spectrum, while 
others, under special distributions of the segment statistics, may even have vanishing divergences. 
Performing recursive Jensen- Shannon segmentation on this ten-segment sequence, we recover 
all nine domain walls. These, however, are not discovered in the order of their true strengths 
(heights of the blue bars), measured by the Jensen-Shannon divergence of the pairs of segments 
they separate. For the example shown, the third strongest domain wall is discovered in the first 
recursion step, the second and fourth strongest domain walls in the second recursion step, and 
the strongest domain wall discovered only in the third recursion step. 
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normalized sequence position 
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 I 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 
normalized sequence position 



Fig. 3. The Jensen-Shannon divergence A{z) (red sohd curve) as a function of the normalized cursor position z within 
an artificial binary sequence composed of ten mean-field segments, characterized by the probabilities (left to right) P(0) = 
(0.55, 0.05, 0.20, 0.60, 0.65, 0.30, 0.45, 0.05, 0.45, 0.15). The blue bars indicate the strength of each of the nine domain walls, 
while the number at each domain wall indicate which recursion step it is discovered. (Inset) The Jensen-Shannon divergence 
A{z) (red solid curve) as a function of the normalized cursor position z within an artificial binary sequence composed of two 
mean-field segments, characterized by the probabilities -Pl(O) = 0.1 and Pr{0) = 0.9. The domain wall at z — 0.60 is indicated 
by the blue dashed vertical line. 



IV. Optimized Recursive Segmentation Scheme 

A typical bacterial genome contains on the order of ~ 10^ bases, within which we can justify 
at most M ~ 10'^ statistically stationary segments. The segmentation problem thus involves 
optimally placing M domain walls {ii, . . . , im, . . . , ^a/}. If we place no restrictions on where im 
can be, apart from the fact that it must be an integer between 1 and A^, our high-dimensional 
optimization problem lives in an enormous maximal search space 

M = {{ii, ...,im,-- • ,«m)|1 < ii, ■ ■ ■ ,im, ■ ■ ■ ,iM < N} (26) 

consisting of A^^^ ~ 10^°°° points. The actual search space is much smaller, because we must 
satisfy the constraint 1 < ii < ^2 < ■ • ■ < ^Af < and smaller still if we demand that 
im+i — im, for statistical reasons, must be larger than some minimum separation. But even this 
realistic search space is huge, and there is no good global algorithm for moving the M domain 
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walls simultaneously from an initial point in the search space, whatever objective function (e.g. 
M-segment sequence likelihood, net strengths of the domain walls, net deviation of segments 
from stationarity) we choose for the optimization problem. 

Clearly, some complexity reduction strategy is needed to solve this high-dimensional opti- 
mization problem. In the recursive Jensen-Shannon segmentation scheme of Bernaola-Galvan 
et al. [9], [10], the complexity of the full problem is reduced by breaking the search for M 
domain walls into logM stages. At each stage of the search, the computational complexity is 
further reduced by the restriction that there shall be at most one new domain wall between 
every existing pair of adjacent domain walls. We review this recursive segmentation scheme in 
Sec. IIV-A[ and explain that while this method is conceptually appealing, the mean-field analysis 
in Sec. IIII-B] tells us that the segmentation obtained at each stage of the recursion is not optimal. 
We then describe in Sec. IIV-BI two local optimization schemes, in which a single domain wall 
is moved each time, and discuss how these schemes that can be incorporated into the recursive 
Jensen-Shannon segmentation framework. 

A. The Need for Segmentation Optimization 

While not explicitly stated in their formulation, Bernaola-Galvan et al. assumed that the 
segment structures of genomic sequences are organized hierarchically, i.e. strongly distinct 
segments are long, and contain less distinct segments that are shorter, which in turn contain 
even less distinct segments that are shorter still. Within this hierarchical picture of the genomic 
sequence, the recursive segmentation algorithm listed below: 

1) for a given sequence x = xiX2 ■ ■ -x^, compute the Jensen- Shannon divergence spectrum 
A(z) at each cursor position 1 < i < N, and cut x at the divergence maximum i*, where 
A(i*) = maxj A(i), into two segments; 

2) cut each segment at its divergence maximum into two subsegments, and continue doing 
so recursively; 

3) whenever a new cut is made, check whether a termination criterion is met, 

is expected to first discover the strongest domain walls, and then progressively weaker domain 
walls. The segmentation of a given sequence will in this way be progressively refined, by 
adding new domain walls to the existing set of domain walls at every recursion, until a terminal 
segmentation corresponding to a prescribed termination criterion is obtained. 
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From the simple analysis presented in Sec. IIII-Bl we learned that the situation is not quite so 
simple. Indeed, within the mean-field picture, all domain walls will be discovered if we allow 
the recursive segmentation to go to completion. However, the effective strengths of the domain 
walls change with each recursion, with some strong domain walls becoming weak, and other 
weak domain walls becoming strong. Consequently, the domain walls are not discovered in order 
of their true strengths, and an incomplete segmentation may pick up weak domain walls, but 
miss stronger ones. For real discrete sequences subject to local statistical fluctuations, we can 
never be sure by the hypothesis testing or model selection processes that we have exhausted 
all domain walls, so getting incomplete segmentations is a very real worry. In their early paper 
[10], Roman-Roldan et al. noticed the domain wall strengths changing as the segmentation is 
recursively refined, and their solution was to reject a new cut if it causes the statistical significance 
of existing domain walls to fall below the confidence limit. This ad hoc modification of the 
termination criterion was questioned by Li in Ref. 24. 

Since the Jensen-Shannon divergence tells us how much better a segmented model fits the 
observed sequence compared to an unsegmented model, a strong domain wall improves the 
sequence likelihood more than a weak domain wall does. By retaining strong domain walls 
that have become weak in the segmentation, and rejecting new strong cuts that would cause 
existing domain walls to weaken further, we will not be picking the best M- segment model at 
each recursion. The final set of recursively determined domain walls is therefore likely to miss 
some strong domain walls, no matter how sophisticated the hypothesis testing on the statistical 
significance of each new cut, and how much care is taken to ensure that existing domain walls 
remain significant. 

Ultimately, we want our segmentation, if incomplete, to consist of the strongest possible set of 
domain walls. This can be obtained, if we trade weak domain walls for stronger ones by moving 
existing domain walls from weaker to stronger positions, or remove weak domain walls, and let 
the segmentation scheme find stronger replacements in the next recursion step. Realizing this, Li 
proposed the use of branch-merging algorithms developed in the field of recursive partitioning 
and tree-based methods, revisiting all domain walls to see whether their removal will increase 
the statistical significance of the segmentation. However, this was not done in Ref. 24, where 
this was proposed, nor in any papers to date, as far we know. Instead of removing weak domain 
walls, we will present in Sec. IIV-BI two local optimization schemes for moving domain walls 
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from weaker to stronger positions, in the spirit of relaxation methods used in optimization and 
numerical solution of partial differential equations. 

B. Segmentation Optimization Schemes 

From Fig. HI we see that moving the domain wall im changes the statistics of the two segments, 
{im-i,im) and {im,im+i), and as a result, the strengths of three domain walls, im-i, im, and 
im+i, change. Otherwise, the effect of moving im is entirely contained within the supersegment 
{im~2, im+2)- Wc Can of course compute the likelihood of the supersegment {im-2-, hri+2) directly, 
for each im, to determine the position i^ optimizing this supersegment likelihood. However, 
such a supersegment likelihood alone will not tell us whether the domain wall is statistically 
significant. This is why we fall back on the Jensen-Shannon divergence, to justify selecting the 
four-segment model of {im-2, im+2), as opposed to a null model containing fewer domain walls. 
When we move only im., there are two such null models, which suggest two ways to go about 
optimizing im- We call these the first-order segmentation optimization scheme, and the second- 
order segmentation optimization scheme. No higher-order optimization schemes are possible, if 
we allow only one domain wall to move at a time. 



h)i-l h)i h)i + l hn+2 

Fig. 4. The supersegment (im-2, im+2) and its four component segments, (im-2,im-i), (im-i,im), (im,im+i), and 
(im+i,im+2), affected by the moving of the domain wall im- 

In the first-order segmentation optimization scheme, we compute the Jensen-Shannon diver- 
gence spectrum A{i) of the supersegment {im~i, im+i), and move im to the divergence maximum 
of this supersegment. Here, the four-segment model is compared against a three-segment model, 
with segments {im-2, im-i), {im-i,im+i), and {im+i,im+2)- This is a natural comparison to make, 
but not the only one. In Ref. 25, Li et al. reported an experiment to determine the replication 
origins and replication termini of circular bacterial genomes. They cut a circular genome at 
an arbitrary position ii to make it into a linear sequence, and then determine the divergence 
maximum Z2 of this linear sequence. By varying ii, they found that 22 would remain essentially 
constant, before changing abruptly to another almost constant value. Li et al. identified these 
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two Stable divergence maxima with the replication origin and replication terminus of the circular 
bacterial genome. They also noticed that A(?2) reaches a maximum when ii is either at the 
replication origin or the replication terminus. 

This observation can be stated more generally as follows: given four fixed domain walls ii, 
12, ia, and i^, and a variable domain wall 23, such that ii < 12 < < H < the sum 
A(i2) + A(i4) of Jensen-Shannon divergences comparing the four-segment model with segments 
{ii,i2), {"12, is), and (i4,i5) against the two-segment model with segments (iijia) and 

(isjis) is maximum when ^3 is at a true domain wall position i^. Therefore, in the second-order 
segmentation optimization scheme, we would compute A(zm_i)+A(im+i) over the supersegment 
{im-2,im+2) as wc Vary i^, and move to the point maximizing the sum of divergences. As 
we can see from the null models used, the two segmentation optimization schemes are not 
equivalent. Nevertheless, we find in pilot numerical studies on real bacterial genomes that the 
segmentations they produce are always in strong agreement. 

Because these optimization schemes are local updates, they must be applied in turn to the 
domain walls {im}m=i in the segmentation. Clearly, if we move im+i after moving im, there will 
be no guarantee that im remains optimal within either schemes. Therefore, the optimization of 
the M domain walls must be iterated, until a fixed-point segmentation is obtained. Apart from 
the need to iterate the local moves, both optimization schemes can be implemented serially, or 
in parallel, right after new cuts have been made in Step 1, and before further cuts are made 
in Step 2 of the recursive segmentation. Specifically, for the first-order optimization scheme, 
which is simpler to implement and thus used exclusively for our segmentation studies, we can 
optimize all the even domain walls simultaneously, before optimizing all the odd domain walls 
simultaneously. 

V. Segmentation Termination Condition 

In the recursive segmentation scheme described in Sec. UVl the number of statistically signif- 
icant domain walls is known only after segmentation is terminated everywhere in the sequence. 
For an existing segment, this involves making a decision to stop further refinement, based on 
some termination criterion derived within a hypothesis testing framework [9], [10], or a model 
selection framework [26], [27]. The chief shortcoming of these termination criteria is that their 
common assumption that the appropriate null model is that of a statistically stationary sequence. 
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As observed by Fickett et al, there is generally less local homogenuity than necessary in the 
sequence statistics to justify such a null model [28]. 

Indeed, we saw in Fig. [3] that the mean-field divergence spectrum of the ten-segment sequence 
looks nothing like that of a two-segment sequence. Nevertheless, we should make cuts to this 
ten-segment sequence, and in the mean-field limit, keep doing so until the divergence spectra 
of the segments vanish identically. For real sequences, we find that as we segment the sequence 
at a finer and finer scale, the divergence spectrum looks less and less like that coming from 
a one-segment, or two-segment sequence. This suggests that we probably should not be doing 
hypothesis testing, or model selection between one-segment and two-segment models of a given 
sequence, but look at properties intrinsic to the statistical fluctuations. For example, in Fig. [51 
we show the divergence spectra of a long sequence (the interval (237007, 262095) in the E. coli 
K-12 MG1655 genome, bound by two tRNAs), which we clearly should segment, and a short 
sequence (the interval (259595,262095) in the E. coli K-12 MG1655 genome, consisting of the 
proAB operon [29]), which we are inclined not to further segment. The key feature that led us 
to our intuitive decision is the strength of the peak relative to the typical statistical fluctuations. 

But what distinguishes statistical fluctuations from a genuine statistical trend, for example, 
that lead to the divergence peak at i = 259595 within the interval (237007, 262095) (see top 
plot in Fig. [S])? To answer this question, let us consider a 200-segment binary sequence, where 
each segment is of length one, and has unit count for either '0' or '1'. In this binary sequence, 
we find long strings of 'O's and 'I's, as well as regions where the sequence alternates rapidly 
between 'O's and 'I's. These correspond to smooth and spiky regions in the mean-field divergence 
spectrum, shown as the black curve in the top plot of Fig. [6l respectively. Based on the mean- 
field divergence spectrum, we can think of the sequence as comprising M < 200 long and short 
segments, the shortest of which are of unit length. Let us call the divergence spectrum of a 
sequence containing unit-length segments the raw divergence spectrum A(i). 

Of course, it makes no statistical sense to talk about unit-length segments, but let us pretend 
that it is perfectly alright to have segments with length n = 2. This being the case, the n = 1 
segments must be absorbed into its longer flanking segments, or merged amongst themselves, 
to give segments that are at least n = 2 long. This is the sequence segmentation problem in a 
different guise, so there are no simple solutions. There are, however, many statistically reasonable 
ways to perform this n = 1 — n = 2 coarse graining, and one of them gives rise to the red 
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Fig. 5. (Top) The K = Jensen-Shannon divergence spectrum (black) for the interval (237007, 262095), bound by the tRNAs 
aspV (236931, 237007) and thrW (262095, 262170) on the E. coli K-12 MG1655 genome. The two peaks highlighted are i = 
239320 and i = 259595. (Bottom) The A" = Jensen-Shannon divergence spectrum (black) for the interval (259595, 262095) 
of the E. coli K-12 MG1655 genome. While (259595, 262095) is a subinterval of (237007, 262095), the bottom divergence 
spectrum is not a blow up version of the same region in the top divergence spectrum. In both plots, the red spectra are derived 
from the respective black spectra by approximate coarse graining, assuming a minimum statistically reliable segment length of 



mean-field divergence spectrum shown in Fig. [6l We can repeat this coarse-graining procedure, 
coarse graining n = 2 segments to obtain segments at least n = A long, giving the green mean- 
field divergence spectrum, and then again, coarse graining n = 4 segments to to obtain segments 
at least n = 8 long, giving the blue mean-field divergence spectrum, both of which are shown 
in the top plot of Fig. [6l 

As we can see, by making the shortest segments in the sequence longer and longer, the mean- 
field divergence spectrum becomes smoother and smoother. Once we have achieved the desired 
minimum segment length n, we can compare the coarse-grained divergence spectrum A(i) to 
the raw divergence spectrum A(?). The strength of the statistical fluctuations in the sequence 
can then be quantified by the integrated absolute difference between the two divergence spectra. 



n = 128. 




(27) 
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Fig. 6. (Top) A series of K — mean-field Jensen-Shannon divergence spectra for an artificial binary sequence of length 
A'^ = 200. At the finest scale, shown as the black divergence spectrum, the sequence is allowed to contain unit-length segments. 
When we require the sequence to contain segments no shorter than n = 2,4, 8, we obtain the red, green, and blue divergence 
spectra respectively. (Bottom) The K — Jensen-Shannon divergence spectrum (black) for an artificial binary sequence of 
length = 200, and the divergence spectra obtained from coarse graining approximately at length scales n — 2 (red), n — 4 
(green), and n = 8 (blue). 



for this given final n. To decide whether we should segment the sequence, we compare this 
integrated statistical fluctuation 6 A against the total area 

A= diA{{) (28) 
Jo 

under the raw divergence spectrum. If the ratio 5 A/ A is small, like for the top divergence 
spectrum in Fig. [51 a cut placed at the divergence maximum will be statistically significant. If 
SA/A is large, like for the bottom divergence spectrum in Fig. [51 a cut placed at the divergence 
maximum will not be statistically significant. 

In practice, coarse graining a segmentation at scale n (meaning that the shortest segments are 
at least n long) to give a segmentation at scale 2n is a time-consuming optimization problem, so 
we devise an approximate form of the test statistic 5 A/ A to quantify the strength of statistical 
fluctuations in the divergence spectrum. This approximate test statistic is constructed for a discrete 
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sequence as follows: 

1) for a given set of sequence positions {v}rli divergences {A^ = A(v)}*li, partition 
the sequence positions into a set of convex points, satisfying 

A, < A,_i + Jr±lZi^A,+i, (29) 

and a set of concave points, which fail the above convexity condition; 

2) iterate through the set of concave points, and retain the concave point v if V — V-i > n 
and ir+i — ir > n. Otherwise, we reject i^, and all convex points within n of it. The set of 
retained concave points, together with convex points that are not rejected, form a coarse- 
grained set of domain walls that are at least n apart from each other, and the divergences 
at these points give a coarse-grained divergence spectrum; 

3) multiply n by two, and if the new length scale is less than the desired final length scale 
Tic, repeat steps 1 and 2. Otherwise, stop the coarse graining process. 

In the bottom plot of Fig. [6l we show the divergence spectra obtained from coarse graining 
approximately at length scales n = 2, 4, 8. 

For real genomes, we require this approximate coarse graining to proceed until ric = 128 • 4^* 
for a quaternary sequence whose optimal Markov-chain order is K*. This is to ensure that each 
transition count would be at least 20-30, so that the maximum-likelihood transition probabilities 
can be reliably estimated. Going back to Fig. [5l we see that the Uc = 128 approximate coarse- 
grained divergence spectrum (red curve) for the long interval (237007, 262095) is very close to 
its raw divergence spectrum (black curve), whereas for the short interval (259595, 262095), the 
Uc = 128 approximate coarse-grained divergence spectrum (red curve) is significantly different 
from its raw divergence spectrum (black curve). If we compute 5 A using the approximate coarse- 
grained divergence spectra A(i) for the two intervals, we will find a small SA/A for the long 
interval, and a large 5 A/ A for the short interval. 

Based on the above discussions, we propose to use 

as an alternative termination criterion, where (SA/A)* is a user prescribed tolerance. The test 
statistic 5 A/ A requires no prior assumption on how many segments to partition the sequence 
into, but instead requires that the shortest segments used to model the sequence must yield 
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adequate statistics to reliably estimate the model parameters. We believe that 6 A/ A, which 
measures quantitatively the relative strength of statistical fluctuations up to some cutoff length 
scale Uc, is a better test statistic to use as the termination criterion, as there is no bias towards any 
particular segment model of the given sequence. Based on extensive experimenting, we find that 
{6 A/ A)* = 0.30 will stop the recursive segmentation just before individual genes are segmented. 

VI. Application to Real Genomes 

In the original recursive segmentation algorithm proposed by Bernaola-Galvan et al. [9], [10], 
it does not matter how many new cuts are added to the sequence at each recursion step — we get 
the same terminal segmentation for the same termination criterion. However, if we optimize the 
segmentation before new cuts are added, then the terminal segmentation depends very sensitively 
on the termination criterion. To benchmark the performance of the recursive Jensen-Shannon 
segmentation scheme, with and without segmentation optimization, we add only one new cut — 
the strongest of all new cuts possible — every recursion step. A series of recursive segmentations 
(with and without segmentation optimization) obtained this way is shown in Fig. |7] for the E. 
coli K-12 MG1655 genome, for 2 < M < 20 domain walls in the sequence. 

From Fig. |71 we find that in contrast to the truly hierarchical unoptimized recursive segmen- 
tations, the optimized recursive segmentations of E. coli K-12 MG1655 are almost hierarchical, 
i.e. few domain walls in segmentations containing more domain walls are shifted relative to 
segmentations containing fewer domain walls. Those few domain walls which do get shifted, 
however, can be shifted a lot. For example, we see the domain wall iiQ = 4051637 in the 
optimized segmentation with M = 10 domain walls get shifted to iio = 4469701 in the optimized 
segmentation with M = 11 domain walls {5iio = +418064). Another example is ij = 2135183 
in the optimized segmentation with M = 15 domain walls shifted to = 2629043 in the 
optimized segmentation with M = 16 domain walls (Sij = +493860). We also observed that 
domain walls 'lost' as a result of optimization frequently reappear later in the optimized recursive 
segmentation. For example, the domain wall iio = 4051637, which was shifted to iiQ = 4469701 
when a new cut was made to the optimized segmentation with M = 10 domain walls, reappears 
in the optimized segmentation with M = 29 domain walls (not shown in Fig. jV]). These are 
manifestations of what we call the context sensitivity problem, which we will carefully study in 
another paper [30]. 
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Fig. 7. Series of recursive Jensen-Shannon segmentations of the E. coli K-12 MG1655 genome (A*' = 4639675 bp), with and 
without segmentation optimization, for (top to bottom) 2 < M < 20 domain walls. In this figure, unoptimized domain walls 
are shown in red, optimized domain walls are shown in blue, and unoptimized and optimized domain walls agreeing to within 
2000 bp of each other are shown in magenta. The two domain walls that appear in the M — 2 optimized segmentation are 
close to the replication origin and replication terminus, and remain close to where they were first discovered as the recursive 
segmentation progresses. 



In the unoptimized recursive segmentation scheme, a domain wall remains at where it was 
first discovered, and its statistical significance is measured solely by its strength. However, the 
strength of existing domain walls frequently change as more and more domain walls are added 
to the segmentation, so it is not clear what kind of statistical significance to attach to a domain 
wall that becomes progressively weaker (or one that becomes progressively stronger, for that 
matter). In the optimized recursive segmentation scheme, we find that there are some domain 
walls that remain close to where they are first discovered, which we call stable domain walls, 
and those that do not as recursion proceeds. Stability with respect to segmentation refinement- 
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optimization provides an alternative measure of statistical significance, in the sense that at any 
level of segmentation, a stable domain wall is more significant than a domain wall that has 
been shifted around for the past few recursions. A domain wall that has been stable over a 
larger number of recursions is also more significant than a domain wall that has been stable 
over a smaller number of recursions. From Fig. Ul we see that stable domain walls are always 
discovered first by the optimized segmentation scheme, if not simultaneously by both schemes. 

VII. Conclusions 

To summarize, we have in this paper presented improvements to three different aspects 
of the recursive Jensen-Shannon segmentation scheme proposed by Bernaola-Galvan et al. In 
Sec. ini which touch on the modeling aspects of recursive segmentation, we explained how 
Markov chains of order K > 0, producing better fits of the higher order statistics with fewer 
independent parameters, are better models of the segments within a heterogeneous genomic 
sequence, compared to Bernoulli chains over the quaternary or extended alphabets. We then 
wrote down a generalized Jensen-Shannon divergence, given in Eq. (fT6l) . which can be used 
as an entropic measure to distinguish between two different Markov chains of the same order. 
Moving on, we described how to select a maximum Markov-chain order for the segmentation 
of a given sequence, by minimizing its Bayes information criterion (BIC), and how to optimize 
the position of the domain wall between two segments with different maximum Markov-chain 
orders. 

Next, in Sec. [nil we described how the spatial distribution of rare transitions along the sequence 
give rise to local fluctuations in the Jensen-Shannon divergence spectrum, confusing our intuitive 
picture of how recursive segmentation proceeds in a heterogeneous sequence. We argued that a 
mean-field analysis, which we then undertook, would be useful in better understanding recursive 
segmentation. We found that in the mean-field Jensen-Shannon divergence spectrum, true domain 
walls can appear either as peaks or kinks, or even have vanishing divergences. We showed that 
all true domain walls will eventually be discovered, if we allow the recursive segmentation to 
go to completion, but these will in general not be discovered in the order of their true strengths. 
Consequently, an incomplete segmentation will generically be suboptimal, because it contains 
weak domain walls that are discovered ahead of stronger ones, which are thus left out of the 
segmentation, and we argued in Sec.|IV]that there is a need to optimize the segmentation obtained 
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at every stage of the recursive entropic segmentation. We then proposed two local optimization 
schemes to move domain walls from weaker to stronger positions, which improves the statistical 
significance of an incomplete segmentation. 

In Sec. |Vl we improved on the statistical testing aspect of the recursive Jensen-Shannon 
segmentation scheme by proposing a new termination criterion. We motivated the test statistic 
associated with this new termination criterion by considering the problem of recursively coarse 
graining an initial segmentation containing very short segments to produce a final segmentation 
containing no segments shorter than a cutoff length scale 71^. We devised an algorithm for 
performing this recursive coarse graining approximately, and argued that the fraction difference 
5 A/ A in areas under the raw divergence spectrum of the initial segmentation and the smoothed 
divergence spectrum of the final segmentation, can be used as the test statistic for terminating 
recursive segmentation. The principal advantage of using this new termination criterion, which is 
a quantitative measure of the strength of sub-ric statistical fluctuations relative to the maximum 
divergence in the raw divergence spectrum, is that it requires no prior knowledge of how many 
segments to partition the sequence into, and thus has no need to assume a single-segment null 
model for the sequence. 

Finally, in Sec. |VIl we performed a benchmark study on the genome of the model bacterium 
E. coli K-12 MG1655, comparing the recursive Jensen-Shannon segmentation schemes with and 
without optimization. The optimized segmentations obtained at different stages of the recursion 
were found to be not perfectly hierarchical, because of the context sensitivity problem which 
we will investigate in another paper [30]. At the coarse levels of segmentations examined, we 
found the recursive segmentation schemes with and without segmentation agreeing on a set of 
stable domain walls, but not on the order these are discovered. We argued that the stability of 
a domain wall with respect to optimized recursive segmentation is an alternative measure of its 
statistical significance, and found that these stable domain walls are always discovered first by 
the optimized segmentation scheme, if not simultaneously by both schemes. 

Besides the context sensitivity problem which we realize plagues all segmentation schemes 
[30], the innovations and discoveries in this paper also inspired other studies. In a future paper 
[31], we will report a clustering analysis on the terminal genomic segments, obtained using the 
optimized recursive Jensen- Shannon segmentation scheme with the new termination criterion, of 
a bacterial plant pathogen. Based on this clustering analysis on a single bacterial genome, we will 
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construct its statistical genetic backbone, and identify possible horizontally transferred genetic 
islands. We will then extend the clustering analysis for comparative studies of several closely 
related bacteria, where we will detect syntenic regions, and identify a phylogenetic backbone. We 
would like to suggest that, even though the optimized recursive segmentation scheme presented 
in this paper is formulated as a method for biological sequence segmentation, it can also be 
applied in other engineering segmentation problems (for example, in image segmentation, and 
change point detection in noisy time series). 
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